Skip to content

cl/test: rnot: add cloud topic workloads#30435

Open
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:ct/shadown-linking-rnot-test
Open

cl/test: rnot: add cloud topic workloads#30435
Lazin wants to merge 3 commits into
redpanda-data:devfrom
Lazin:ct/shadown-linking-rnot-test

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented May 11, 2026

Adds cloud-topic and tiered-cloud-topic workloads to the shadow linking random node ops test so we exercise plain cloud and tiered_cloud storage modes alongside the existing si, compacted, and transactional workloads. Enables the explicit-only tiered_cloud_topics feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide; allow-lists the expected cloud-topics shutdown warnings.

Fixes the bug in the write-at-offset code path in the cloud topics frontend. The frontend was converting batches of all types as placeholders. This caused the stall in the target cluster. The second commit in the PR fixes this.

Finally, the test adds new workload that constantly flips between cloud and tiered_cloud modes. The goal is to have a mix of raft_data and ct_placeholder batches in the partition which is being shadowed.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings May 11, 2026 16:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends the shadow-linking random node operations test to also exercise cloud-topics and tiered-cloud-topics storage modes during node operations, including enabling the necessary cluster config/feature flags and allow-listing expected cloud-topics shutdown/retry log messages.

Changes:

  • Enable cloud_topics_enabled on both clusters and activate the explicit-only tiered_cloud_topics feature before topic creation.
  • Increase preallocated client nodes and ducktape cluster node count to support two additional concurrent workloads.
  • Add two new workloads (cloud-topic, tiered-cloud-topic) using redpanda.storage.mode topic config and allow-list expected cloud-topics shadow-link logs.

Comment on lines 264 to 267
extra_rp_conf={
"group_new_member_join_timeout": 3000,
CLOUD_TOPICS_CONFIG_STR: True,
},
Comment on lines 269 to 284
@@ -276,6 +280,7 @@ def __init__(self, test_ctx: TestContext):
"retention_local_trim_interval": 5000,
"partition_autobalancing_tick_interval_ms": 2000,
"group_new_member_join_timeout": 3000,
CLOUD_TOPICS_CONFIG_STR: True,
},
Comment on lines +329 to +333
# enabled on both clusters before any tiered_cloud topic can be
# created.
self.source_cluster.service.set_feature_active(
"tiered_cloud_topics", True, timeout_sec=30
)
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 11, 2026

CI test results

test results on build#84286
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1802-6e95-4755-8941-7102b14e57e6 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0121, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS) Datalake3rdPartyMaintenanceTest test_e2e_basic {"catalog_type": "rest_hadoop", "cloud_storage_type": 1, "query_engine": "trino"} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-339b-4231-8b6b-79f43ff986f4 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=Datalake3rdPartyMaintenanceTest&test_method=test_e2e_basic
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84286#019e1801-33a0-4837-90cb-d329d573bfe5 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0975, p0=0.6415, reject_threshold=0.0100. adj_baseline=0.2649, p1=0.2121, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#84317
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_with_restart {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0305, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_with_restart
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": false, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c904-4687-a455-376509e29248 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": false, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7282-4ecf-a3d6-f8c8957b188f 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb4-c906-4a15-981b-e4328ad8375c 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
FAIL ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84317#019e1bb6-7284-431b-ad2e-457808edc9a3 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
test results on build#84334
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkBasicTests test_link_creation_checks {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}} integration https://buildkite.com/redpanda/redpanda/builds/84334#019e1cca-dd8e-4c47-96c7-7b6c2f245bba 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0225, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkBasicTests&test_method=test_link_creation_checks
FLAKY(PASS) ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "basic"} integration https://buildkite.com/redpanda/redpanda/builds/84334#019e1ccb-528e-440b-9643-891f40a27ca9 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations

@Lazin Lazin requested a review from pgellert May 11, 2026 18:58
@Lazin Lazin force-pushed the ct/shadown-linking-rnot-test branch from 745e0db to c0ffaf5 Compare May 12, 2026 10:13
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 12, 2026

Retry command for Build#84317

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":false,"workload_set":"cloud_combos"}
tests/rptest/tests/shadow_linking_rnot_test.py::ShadowLinkingRandomOpsTest.test_node_operations@{"failures":true,"workload_set":"cloud_combos"}

@Lazin Lazin force-pushed the ct/shadown-linking-rnot-test branch from 6c0c15f to b0d9665 Compare May 12, 2026 15:18
@Lazin Lazin requested review from WillemKauf and dotnwat May 12, 2026 15:56
Copy link
Copy Markdown
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll let someone from the cloud topics team approve

Comment thread src/v/cloud_topics/frontend/frontend.cc Outdated
Comment on lines +1283 to +1285
for (auto&& b : passthrough_batches) {
final_batches.push_back(std::move(b));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be simpler here:

Suggested change
for (auto&& b : passthrough_batches) {
final_batches.push_back(std::move(b));
}
final_batches = std::move(passthrough_batches);

@dotnwat dotnwat requested review from andrwng and nvartolomei May 13, 2026 20:37
Comment thread src/v/cloud_topics/frontend/frontend.cc Outdated
Comment on lines +1250 to +1251
// (raft_configuration, tx_fence, control batches like transaction
// commit/abort markers, etc.) carry their payload in the record
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of confused by this -- I was under the impression that the only batches we expect here are data batches (which may include tx control batches, but not raft configuration). Is that the case? What non-raft_data batches do we expect here? If we're just being conservative here, could we update the comment to indicate that?

Comment thread src/v/cloud_topics/frontend/frontend.cc Outdated
// Per-input-position slots: true means the slot will be filled with a
// generated placeholder, false means it carries a pass-through batch
// already stored in `passthrough_batches`.
chunked_vector<bool> is_data_slot;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to make this a set of non-data indexes. At least, I imagine we're more likely to have zero non-data batches in most cases.

Comment thread src/v/cloud_topics/frontend/frontend.cc Outdated
Comment on lines +1362 to +1380
// Interleave generated placeholders with pass-through batches to
// restore the original input order.
auto ph_it = placeholders.batches.begin();
auto pt_it = passthrough_batches.begin();
for (bool is_data : is_data_slot) {
if (is_data) {
vassert(
ph_it != placeholders.batches.end(),
"placeholder count mismatch for {}",
ntp());
final_batches.push_back(std::move(*ph_it++));
} else {
vassert(
pt_it != passthrough_batches.end(),
"passthrough count mismatch for {}",
ntp());
final_batches.push_back(std::move(*pt_it++));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need the bitmap at all. Do these batches have offsets assigned already? If so, could we merge them by offset?

headers.reserve(batches.size());
for (const auto& batch : batches) {
headers.push_back(batch.header());
// Only user data batches (raft_data with !is_control()) are uploaded
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shower thought, write_at_offset might be a natural place to write directly as L1. In the context of shadow linking, we know the data is stable and has assigned offsets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this. The problem is that L1 objects are bounded by last stable offsets but the shadowing is bounded only by the high watermark. So the write_at_offset has to replicate batches that belong to transactions which are not yet committed + control batches.

Lazin added 3 commits May 20, 2026 11:30
Adds cloud-topic and tiered-cloud-topic workloads to the shadow
linking random node ops test so we exercise plain cloud and
tiered_cloud storage modes alongside the existing si, compacted, and
transactional workloads. Enables the explicit-only tiered_cloud_topics
feature on both clusters and CLOUD_TOPICS_CONFIG_STR cluster-wide;
allow-lists the expected cloud-topics shutdown warnings.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
For storage.mode=cloud topics, replicate_at_offset previously sent
every input batch through stage_write/execute_write and wrapped each
one as a ctp_placeholder. The placeholder encoding drops the record
key, so for control records (e.g. transaction commit/abort markers)
the original key bytes are lost and rm_stm's parse_control_batch
throws std::out_of_range on the empty iobuf, halting state machine
apply at the marker offset.

Split the input list into user data batches (raft_data with
!is_control()) and pass-through batches (tx_fence, control batches,
etc.). Only data batches are uploaded to L0 and wrapped as
ctp_placeholders; the rest are forwarded to the write_at_offset_stm
unchanged. The original input ordering is preserved by interleaving
the generated placeholders with the pass-through batches.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Adds a new "flipping" workload_set matrix variant. A single workload
runs against flipping-storage-topic while a background daemon thread
toggles redpanda.storage.mode between cloud and tiered_cloud every 3
seconds on the source. Transient alter-config failures (leader
changes, partition movement) are logged and retried on the next tick;
the target config is not separately verified.

Wired through ClusterLinkingWorkloadSpec via optional
flip_storage_modes / flip_interval_seconds fields so other workloads
can opt in if needed.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
@Lazin Lazin force-pushed the ct/shadown-linking-rnot-test branch from b0d9665 to 2a0a5ea Compare May 20, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants